Likelihood / Regression

The others side of the coin

Statistics: Interested in estimating population-level characteristics; i.e., the parameters

\[ \begin{align*} y \rightarrow& f(y|\boldsymbol{\theta}) \\ \end{align*} \]

Estimation

  • likelihood
  • Bayesian

Likelihood

Likelihood principle

Given a statistical model, all the evidence/information in a sample (\(\textbf{y}\), i.e., data) relevant to model parameters (\(\theta\)) is contained in the likelihood function.

Fisher Information

The information an observable random variable (\(\textbf{y}\)) has about an unknown parameter \(\theta\) upon which the probability of \(\textbf{y}\) \((f(\textbf{y};\theta)\) depends.



Information is conditional

To learn about \(\theta\) from \(\textbf{y}\), we need to link them together via a special function, \(f(\textbf{y};\theta)\)

## Statistical Information


The pieces:

::: incremental - The sample data, \(\textbf{y}\)

  • A probability function for \(\textbf{y}\):

    • \(f(\textbf{y};\theta)\)

    • \([\textbf{y}|\theta]\)

  • The unknown parameter: \(\theta\)

    • specified in the probability function

:::

The Likelihood Function

What we can say about our parameters using this function?

\[ \begin{align*} \mathcal{L}(\boldsymbol{\theta}|y) = P(y|\boldsymbol{\theta}) = f(y|\boldsymbol{\theta}) \end{align*} \]

The likelihood (\(\mathcal{L}\)) of the unknown parameters, given our data, can be calculated using our probability function.

CODE:

# A data point
  y=c(10)

#the likelihood the mean is 8, given our data
  dnorm(y,mean=8)
[1] 0.05399097


If we knew the mean is truly 8, it would also be the probability density of the observation y = 10.

Many Parameter Guesses

# Let's take many guesses of the mean
  means=seq(0,20,by=0.1)

# Use dnorm to get likelihood of each guess of the mean
# Assumes sd = 1
  likelihood=dnorm(y, mean=means)

Maximum Likelihoof Properties

  • MLEs are consistent. As sample size increases, they will converge to the true parameter value.

  • MLEs are asymptotically unbiased. The \(E[\hat{\theta}]\) converges to \(\theta\) as the sample size gets larger.

  • No guarantee that MLE is unbiased as small sample size. Can be tested!

  • MLEs will have the minimum variance among all estimators, as the sample size gets larger.

Statistics and PDF Example

What is the mean height of King Penguins?

Statistics and PDF Example

We go and collect data,

\(\boldsymbol{y} = \begin{matrix} [4.34 & 3.53 & 3.75] \end{matrix}\)


Let’s decide to use the Normal Distribution as our PDF.

\[ \begin{align*} f(y_1 = 4.34|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{1}-\mu}{\sigma})^2} \\ \end{align*} \]

AND

\[ \begin{align*} f(y_2 = 3.53|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{2}-\mu}{\sigma})^2} \\ \end{align*} \] . . .

AND

\[ \begin{align*} f(y_3 = 3.75|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{3}-\mu}{\sigma})^2} \\ \end{align*} \]

Or simply,

\[ \textbf{y} \stackrel{iid}{\sim} \text{Normal}(\mu, \sigma) \] . . .

\(iid\) = independent and identically distributed

Continued

The joint probability of our data with shared parameters \(\mu\) and \(\sigma\),

\[ \begin{align*} & P(Y_{1} = y_1,Y_{2} = y_2, Y_{3} = y_3 | \mu, \sigma) \\ &= \mathcal{L}(\mu, \sigma|\textbf{y}) \end{align*} \]

IF each \(y_{i}\) is independent, the joint probability of our data are simply the multiplication of all three probability densities,

\[ \begin{align*} =& f(y_{1}|\mu, \sigma)\times f(y_{2}|\mu, \sigma)\times f(y_{3}|\mu, \sigma) \end{align*} \]

We can do this because we are assuming knowing one value (\(y_1\)) does not tell us any new information about another value \(y_2\).

\[ \begin{align*} =& \prod_{i=1}^{3} f(y_{i}|\mu, \sigma) \\ =& \mathcal{L}(\mu, \sigma|y_{1},y_{2},y_{3}) \end{align*} \]

Code

Translate the math to code…

# penguin height data
  y=c(4.34, 3.53, 3.75)

#Joint likelihood of mu=3, sigma =1, given our data
  prod(dnorm(y,mean=3,sd=1))
[1] 0.01696987


Calcualte likelihood of many guesses of \(\mu\) and \(\sigma\) simultaneously,

# The Guesses
  mu=seq(0,6,0.05)
  sigma=seq(0.01,2,0.05)
  try=expand.grid(mu,sigma)
  colnames(try)=c("mu","sigma")

# function
fun=function(a,b){
  prod(dnorm(y,mean=a,sd=b))
  }

# mapply the function with the inputs
  likelihood=mapply(a=try$mu,b=try$sigma, FUN=fun)

# maximum likelihood of parameters
  try[which.max(likelihood),]
      mu sigma
925 3.85  0.36


Likelihood plot (3D)

Sample Size

What happens to the likelihood if we increase the sample size to N=100?

Proper Guessing

Let’s let the computer do some smarter guessing, i.e., optimization.

#Note: optim function uses minimization, not maximization. 
#THUS, need to put negative in our function

#Note: log=TRUE, allows us to add rather than multiply 
#      (sum, instead of prod)

neg.log.likelihood=function(par){
  -sum(dnorm(y,mean=par[1],sd=par[2],log=TRUE))
  }

#find the values that minimizes the function
#c(1,1) are the initial values for mu and sigma
fit <- optim(par=c(1,1), fn=neg.log.likelihood,
             method="L-BFGS-B",
             lower=c(0,0),upper=c(10,1))

#Maximum likihood estimates for mu and sigma
fit$par
[1] 3.901219 0.927352

The Linear Regression way

King Penguin Height Data (N=100)

out=lm(y~1)
summary(out)

Call:
lm(formula = y ~ 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.00136 -0.61581  0.01208  0.67407  2.58369 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.9012     0.0932   41.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.932 on 99 degrees of freedom

Moments

Book Chapter (Section 3.4.4.)

Probability Functions have qualities called ‘Moments’

  • Moment 1: Expected Value (mean)

  • Moment 2: Variance

  • Moment 3: Skewness

  • Moment 4: Kurtosis

  • Moment k: …

Parameters are not always moments.